Note to the reader

markdown comments noted by the student/author (John Leonard) are highlighted in red. The final section of the document (section 7) contains the informal written report



Your Task:

FROM: Danielle Sherman
Subject: Brand Preference Prediction Hello,

The sales team has again consulted with me with some concerns about ongoing product sales in one of our stores. Specifically, they have been tracking the sales performance of specific product types and would like us to redo our previous sales prediction analysis, but this time they’d like us to include the ‘product type’ attribute in our predictions to better understand how specific product types perform against each other. They have asked our team to analyze historical sales data and then make sales volume predictions for a list of new product types, some of which are also from a previous task. This will help the sales team better understand how types of products might impact sales across the enterprise.

I have attached historical sales data and new product data sets to this email. I would like for you to do the analysis with the goals of:

Predicting sales of four different product types: PC, Laptops, Netbooks and Smartphones Assessing the impact services reviews and customer reviews have on sales of different product types When you have completed your analysis, please submit a brief report that includes the methods you employed and your results. I would also like to see the results exported from R for each of the methods.

Thanks,

Danielle

existingproductattributes2017
newproductattributes2017



Plan of Attack

Introduction

Your Task

You have been asked by Danielle Sherman, CTO of Blackwell Electronics, to predict the sales in four different product types while assessing the effects service and customer reviews have on sales. You’ll be using Regression to build machine learning models for this analyses using a choice of two of three popular algorithms. Once you have determined which one works better on the provided data set, Danielle would like you to predict the sales of four product types from the new products list and prepare a report of your findings.

This task requires you to prepare one deliverable for Danielle Sherman:

Sales Prediction Report. A report in a Zip file that includes:

  • A brief summary in Word or PowerPoint of your methods and results that include:
    • The algorithms you tried.
    • The algorithm you selected to make the predictions, including a rationale for selecting the method you did and the level of confidence in the predictions.
    • Your sales predictions for four target product types found in the new product attributes data set
    • A chart that displays the impact of customer and service reviews have on sales volume.
  • The results of each model you constructed, exported from R The steps in the following tabs will walk you through this process.


Setup

#Load the libraries & set random seed
library(caret)
library(readr) 
set.seed(1)
#load the data set
df <- read_csv("existingproductattributes2017.csv");
Parsed with column specification:
cols(
  ProductType = col_character(),
  ProductNum = col_double(),
  Price = col_double(),
  x5StarReviews = col_double(),
  x4StarReviews = col_double(),
  x3StarReviews = col_double(),
  x2StarReviews = col_double(),
  x1StarReviews = col_double(),
  PositiveServiceReview = col_double(),
  NegativeServiceReview = col_double(),
  Recommendproduct = col_double(),
  BestSellersRank = col_double(),
  ShippingWeight = col_double(),
  ProductDepth = col_double(),
  ProductWidth = col_double(),
  ProductHeight = col_double(),
  ProfitMargin = col_double(),
  Volume = col_double()
)

1. Pre-Process the Data

In previous Regression tasks, you needed to remove non-numeric features to make predictions; however, typical datasets don’t contain only numeric values. Most data will contain a mixture of numeric and nominal data so we need to understand how to incorporate both when it comes to developing regression models and making predictions.

Categorical variables may be used directly as predictor or predicted variables in a multiple regression model as long as they’ve been converted to binary values. In order to pre-process the sales data as needed we first need to convert all factor or ‘chr’ classes to binary features that contain ‘0’ and ‘1’ classes. Fortunately, caret has a method for creating these ‘Dummy Variables’ as follows:

# one-hot encode (dummify) the data
df_preprocessed <- dummyVars(" ~.",data = df)
df_preprocessed <- data.frame(predict(df_preprocessed,newdata = df))
colnames(df_preprocessed)
 [1] "ProductTypeAccessories"      "ProductTypeDisplay"         
 [3] "ProductTypeExtendedWarranty" "ProductTypeGameConsole"     
 [5] "ProductTypeLaptop"           "ProductTypeNetbook"         
 [7] "ProductTypePC"               "ProductTypePrinter"         
 [9] "ProductTypePrinterSupplies"  "ProductTypeSmartphone"      
[11] "ProductTypeSoftware"         "ProductTypeTablet"          
[13] "ProductNum"                  "Price"                      
[15] "x5StarReviews"               "x4StarReviews"              
[17] "x3StarReviews"               "x2StarReviews"              
[19] "x1StarReviews"               "PositiveServiceReview"      
[21] "NegativeServiceReview"       "Recommendproduct"           
[23] "BestSellersRank"             "ShippingWeight"             
[25] "ProductDepth"                "ProductWidth"               
[27] "ProductHeight"               "ProfitMargin"               
[29] "Volume"                     

Correlation

Correlation as you likely already know about is a measure of the relationship between two or more features or variables. In this problem, you were tasked with ascertaining if some specific features impact on weekly sales volume.

  1. In order to measure the correlation between the variables in the data, all variables must not contain nominal data types. Use the str() to check all of the datatypes in your dataframe.
str(df_preprocessed) #Check data structure
'data.frame':   80 obs. of  29 variables:
 $ ProductTypeAccessories     : num  0 0 0 0 0 1 1 1 1 1 ...
 $ ProductTypeDisplay         : num  0 0 0 0 0 0 0 0 0 0 ...
 $ ProductTypeExtendedWarranty: num  0 0 0 0 0 0 0 0 0 0 ...
 $ ProductTypeGameConsole     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ ProductTypeLaptop          : num  0 0 0 1 1 0 0 0 0 0 ...
 $ ProductTypeNetbook         : num  0 0 0 0 0 0 0 0 0 0 ...
 $ ProductTypePC              : num  1 1 1 0 0 0 0 0 0 0 ...
 $ ProductTypePrinter         : num  0 0 0 0 0 0 0 0 0 0 ...
 $ ProductTypePrinterSupplies : num  0 0 0 0 0 0 0 0 0 0 ...
 $ ProductTypeSmartphone      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ ProductTypeSoftware        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ ProductTypeTablet          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ ProductNum                 : num  101 102 103 104 105 106 107 108 109 110 ...
 $ Price                      : num  949 2250 399 410 1080 ...
 $ x5StarReviews              : num  3 2 3 49 58 83 11 33 16 10 ...
 $ x4StarReviews              : num  3 1 0 19 31 30 3 19 9 1 ...
 $ x3StarReviews              : num  2 0 0 8 11 10 0 12 2 1 ...
 $ x2StarReviews              : num  0 0 0 3 7 9 0 5 0 0 ...
 $ x1StarReviews              : num  0 0 0 9 36 40 1 9 2 0 ...
 $ PositiveServiceReview      : num  2 1 1 7 7 12 3 5 2 2 ...
 $ NegativeServiceReview      : num  0 0 0 8 20 5 0 3 1 0 ...
 $ Recommendproduct           : num  0.9 0.9 0.9 0.8 0.7 0.3 0.9 0.7 0.8 0.9 ...
 $ BestSellersRank            : num  1967 4806 12076 109 268 ...
 $ ShippingWeight             : num  25.8 50 17.4 5.7 7 1.6 7.3 12 1.8 0.75 ...
 $ ProductDepth               : num  23.9 35 10.5 15 12.9 ...
 $ ProductWidth               : num  6.62 31.75 8.3 9.9 0.3 ...
 $ ProductHeight              : num  16.9 19 10.2 1.3 8.9 ...
 $ ProfitMargin               : num  0.15 0.25 0.08 0.08 0.09 0.05 0.05 0.05 0.05 0.05 ...
 $ Volume                     : num  12 8 12 196 232 332 44 132 64 40 ...
  1. Now use summary() to check for missing data. Missing data is represented by “NA”. There are many methods of addressing missing data, but for now let’s delete any attribute that has missing information.
summary(df_preprocessed)
 ProductTypeAccessories ProductTypeDisplay ProductTypeExtendedWarranty
 Min.   :0.000          Min.   :0.0000     Min.   :0.000              
 1st Qu.:0.000          1st Qu.:0.0000     1st Qu.:0.000              
 Median :0.000          Median :0.0000     Median :0.000              
 Mean   :0.325          Mean   :0.0625     Mean   :0.125              
 3rd Qu.:1.000          3rd Qu.:0.0000     3rd Qu.:0.000              
 Max.   :1.000          Max.   :1.0000     Max.   :1.000              
                                                                      
 ProductTypeGameConsole ProductTypeLaptop ProductTypeNetbook ProductTypePC 
 Min.   :0.000          Min.   :0.0000    Min.   :0.000      Min.   :0.00  
 1st Qu.:0.000          1st Qu.:0.0000    1st Qu.:0.000      1st Qu.:0.00  
 Median :0.000          Median :0.0000    Median :0.000      Median :0.00  
 Mean   :0.025          Mean   :0.0375    Mean   :0.025      Mean   :0.05  
 3rd Qu.:0.000          3rd Qu.:0.0000    3rd Qu.:0.000      3rd Qu.:0.00  
 Max.   :1.000          Max.   :1.0000    Max.   :1.000      Max.   :1.00  
                                                                           
 ProductTypePrinter ProductTypePrinterSupplies ProductTypeSmartphone
 Min.   :0.00       Min.   :0.0000             Min.   :0.00         
 1st Qu.:0.00       1st Qu.:0.0000             1st Qu.:0.00         
 Median :0.00       Median :0.0000             Median :0.00         
 Mean   :0.15       Mean   :0.0375             Mean   :0.05         
 3rd Qu.:0.00       3rd Qu.:0.0000             3rd Qu.:0.00         
 Max.   :1.00       Max.   :1.0000             Max.   :1.00         
                                                                    
 ProductTypeSoftware ProductTypeTablet   ProductNum        Price        
 Min.   :0.000       Min.   :0.0000    Min.   :101.0   Min.   :   3.60  
 1st Qu.:0.000       1st Qu.:0.0000    1st Qu.:120.8   1st Qu.:  52.66  
 Median :0.000       Median :0.0000    Median :140.5   Median : 132.72  
 Mean   :0.075       Mean   :0.0375    Mean   :142.6   Mean   : 247.25  
 3rd Qu.:0.000       3rd Qu.:0.0000    3rd Qu.:160.2   3rd Qu.: 352.49  
 Max.   :1.000       Max.   :1.0000    Max.   :200.0   Max.   :2249.99  
                                                                        
 x5StarReviews    x4StarReviews    x3StarReviews    x2StarReviews   
 Min.   :   0.0   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
 1st Qu.:  10.0   1st Qu.:  2.75   1st Qu.:  2.00   1st Qu.:  1.00  
 Median :  50.0   Median : 22.00   Median :  7.00   Median :  3.00  
 Mean   : 176.2   Mean   : 40.20   Mean   : 14.79   Mean   : 13.79  
 3rd Qu.: 306.5   3rd Qu.: 33.00   3rd Qu.: 11.25   3rd Qu.:  7.00  
 Max.   :2801.0   Max.   :431.00   Max.   :162.00   Max.   :370.00  
                                                                    
 x1StarReviews     PositiveServiceReview NegativeServiceReview Recommendproduct
 Min.   :   0.00   Min.   :  0.00        Min.   :  0.000       Min.   :0.100   
 1st Qu.:   2.00   1st Qu.:  2.00        1st Qu.:  1.000       1st Qu.:0.700   
 Median :   8.50   Median :  5.50        Median :  3.000       Median :0.800   
 Mean   :  37.67   Mean   : 51.75        Mean   :  6.225       Mean   :0.745   
 3rd Qu.:  15.25   3rd Qu.: 42.00        3rd Qu.:  6.250       3rd Qu.:0.900   
 Max.   :1654.00   Max.   :536.00        Max.   :112.000       Max.   :1.000   
                                                                               
 BestSellersRank ShippingWeight     ProductDepth      ProductWidth   
 Min.   :    1   Min.   : 0.0100   Min.   :  0.000   Min.   : 0.000  
 1st Qu.:    7   1st Qu.: 0.5125   1st Qu.:  4.775   1st Qu.: 1.750  
 Median :   27   Median : 2.1000   Median :  7.950   Median : 6.800  
 Mean   : 1126   Mean   : 9.6681   Mean   : 14.425   Mean   : 7.819  
 3rd Qu.:  281   3rd Qu.:11.2050   3rd Qu.: 15.025   3rd Qu.:11.275  
 Max.   :17502   Max.   :63.0000   Max.   :300.000   Max.   :31.750  
 NA's   :15                                                          
 ProductHeight     ProfitMargin        Volume     
 Min.   : 0.000   Min.   :0.0500   Min.   :    0  
 1st Qu.: 0.400   1st Qu.:0.0500   1st Qu.:   40  
 Median : 3.950   Median :0.1200   Median :  200  
 Mean   : 6.259   Mean   :0.1545   Mean   :  705  
 3rd Qu.:10.300   3rd Qu.:0.2000   3rd Qu.: 1226  
 Max.   :25.800   Max.   :0.4000   Max.   :11204  
                                                  
#drop columsn that contain NA
drops <- c("BestSellersRank")
df_preprocessed <- df_preprocessed[,!(names(df_preprocessed) %in% drops)]
names(df_preprocessed)
 [1] "ProductTypeAccessories"      "ProductTypeDisplay"         
 [3] "ProductTypeExtendedWarranty" "ProductTypeGameConsole"     
 [5] "ProductTypeLaptop"           "ProductTypeNetbook"         
 [7] "ProductTypePC"               "ProductTypePrinter"         
 [9] "ProductTypePrinterSupplies"  "ProductTypeSmartphone"      
[11] "ProductTypeSoftware"         "ProductTypeTablet"          
[13] "ProductNum"                  "Price"                      
[15] "x5StarReviews"               "x4StarReviews"              
[17] "x3StarReviews"               "x2StarReviews"              
[19] "x1StarReviews"               "PositiveServiceReview"      
[21] "NegativeServiceReview"       "Recommendproduct"           
[23] "ShippingWeight"              "ProductDepth"               
[25] "ProductWidth"                "ProductHeight"              
[27] "ProfitMargin"                "Volume"                     
  1. While correlation doesn’t always imply causation you can start your analysis by finding the correlation between the relevant independent variables and the dependent variable. In the next steps you will use the cor() function to create a correlation matrix that you can visualize to ascertain the correlation between all of the features.
  2. Use the cor() function to build the correlation matrix:
df_corr <-cor(df_preprocessed)

Correlation values fall within -1 and 1 with variables have string positive relationships having correlation values closer to 1 and strong negative relationships with values closer to -1. What kind of relationship do two variables with a correlation of ‘0’ have?

0 correlation corresponds to no linear relationship between the two columns being correlated.

It is often very helpful to visualize the correlation matrix with a heat map so we can ‘see’ the impact different variables have on one another. To generate a heat map for your correlation matrix we’ll use corrplot package as follows:

#install.packages("corrplot")
library(corrplot)
corrplot(df_corr,order="hclust",tl.col="black", tl.srt=90,tl.cex = .45)

blue (cooler) colors show a positive relationship and red (warmer) colors indicate more negative relationships. Knowing what you do about correlation - what do you think intersections in the chart without colors represent?

0 correlation corresponds to no linear relationship between the two columns being correlated.

Using the heat map, review the service and customer review relationships with sales volume and note the associated correlations for your report. If you would like more detailed correlation figures than those available with the heat map, enter the name of your correlation object into console and review the printed information.

Now that you know the relationships between all of the variables in the data it is a good time to remove any features that aren’t needed for your analysis.

# Search for cross correlations > 0.95 and < 1 that aren't related to the label_column ("Volume")
label_column <- "Volume"
drops <- c(label_column)
df_corr_abs <- abs(df_corr)
df_corr_abs <- df_corr_abs[,!(colnames(df_corr_abs) %in% drops)] #drop the label column so you dont remove features correlated with this label
for (col_name in c(colnames(df_corr_abs))){
  df_column <-df_corr_abs[,col_name]
  df_strong_corr<-df_column[df_column>0.95]
  if (length(df_strong_corr)>0) {
    print(col_name)
    print(df_strong_corr)
  }
}
[1] "ProductTypeAccessories"
ProductTypeAccessories 
                     1 
[1] "ProductTypeDisplay"
ProductTypeDisplay 
                 1 
[1] "ProductTypeExtendedWarranty"
ProductTypeExtendedWarranty 
                          1 
[1] "ProductTypeGameConsole"
ProductTypeGameConsole 
                     1 
[1] "ProductTypeLaptop"
ProductTypeLaptop 
                1 
[1] "ProductTypeNetbook"
ProductTypeNetbook 
                 1 
[1] "ProductTypePC"
ProductTypePC 
            1 
[1] "ProductTypePrinter"
ProductTypePrinter 
                 1 
[1] "ProductTypePrinterSupplies"
ProductTypePrinterSupplies 
                         1 
[1] "ProductTypeSmartphone"
ProductTypeSmartphone 
                    1 
[1] "ProductTypeSoftware"
ProductTypeSoftware 
                  1 
[1] "ProductTypeTablet"
ProductTypeTablet 
                1 
[1] "ProductNum"
ProductNum 
         1 
[1] "Price"
Price 
    1 
[1] "x5StarReviews"
x5StarReviews        Volume 
            1             1 
[1] "x4StarReviews"
x4StarReviews 
            1 
[1] "x3StarReviews"
x3StarReviews 
            1 
[1] "x2StarReviews"
x2StarReviews x1StarReviews 
     1.000000      0.951913 
[1] "x1StarReviews"
x2StarReviews x1StarReviews 
     0.951913      1.000000 
[1] "PositiveServiceReview"
PositiveServiceReview 
                    1 
[1] "NegativeServiceReview"
NegativeServiceReview 
                    1 
[1] "Recommendproduct"
Recommendproduct 
               1 
[1] "ShippingWeight"
ShippingWeight 
             1 
[1] "ProductDepth"
ProductDepth 
           1 
[1] "ProductWidth"
ProductWidth 
           1 
[1] "ProductHeight"
ProductHeight 
            1 
[1] "ProfitMargin"
ProfitMargin 
           1 
# Delete one of the pairs from each correlation
drops <- c("x1StarReviews","x5StarReviews")
df_preprocessed<-df_preprocessed[,!(colnames(df_preprocessed) %in% drops)] 
#Transform Volume to log10_Volume to prevent predictions of <0 volume
df_preprocessed<-df_preprocessed[!(df$Volume==0),]
df_preprocessed['log10_Volume'] <- log10(df_preprocessed$Volume)
drops <- c("Volume")
df_preprocessed<-df_preprocessed[,!(colnames(df_preprocessed) %in% drops)] 
#Visualize the data
plot_summary_of_data<-function(DatasetName,x_index=1){
  
  column_names = names(DatasetName)
  
  subplot_cols = 2
  subplot_rows = 2
  par(mfrow=c(subplot_rows,subplot_cols))  
  
  x <- unlist(DatasetName[,x_index])
  x_header = column_names[x_index]
  
  for(i in 1:length(column_names)){
    
    if(i != x_index) {
    y <- unlist(DatasetName[,i])
    y_header = column_names[i]
    
    try(plot(x,y, xlab = x_header, ylab = y_header),silent=TRUE)  #Scatter (Box) Plot
    } 
  }
}
plot_summary_of_data(df_preprocessed,x_index=26)



2. Develop Multiple Regression Models

In this step you will build models, make predictions and learn which algorithms are appropriate for parametric and non-parametric data sets.

  1. Using the steps in ‘R Walkthrough’ that outlined train a linear model, create a linear model that uses volume as its dependent variable. Use the summary() function of R to evaluate the model and make a specific note of the R-Squared value.
#setup seed for reproducability
set.seed(1)
# Define Label
y <- df_preprocessed$log10_Volume
#define a 75-25% train-test split of the dataset
inTraining <- createDataPartition(y, p = .75, list = FALSE)
df_train <- df_preprocessed[inTraining,]
df_test <- df_preprocessed[-inTraining,]
y_train = df_train$log10_Volume
y_test = df_test$log10_Volume
#check dimensions of train & test set
dim(df_train); dim(df_test);
[1] 58 26
[1] 19 26
View(df_train)
train_controls <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
model <- train(log10_Volume ~., data = df_train, method = "lm", trControl=train_controls)
prediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleading
print(model)
Linear Regression 

58 samples
25 predictors

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 52, 51, 53, 51, 53, 52, ... 
Resampling results:

  RMSE       Rsquared   MAE      
  0.7816035  0.6272193  0.5384831

Tuning parameter 'intercept' was held constant at a value of TRUE
cat('\n df_train post resample: \n')

 df_train post resample: 
df_train_test = df_train
y_train_test=y_train
prediction <- predict(model, df_train_test)
prediction from a rank-deficient fit may be misleading
df_postResample<-postResample(pred = prediction, obs = y_train_test)
print(df_postResample)
     RMSE  Rsquared       MAE 
0.2491830 0.8933307 0.1890489 
cat('\n df_test post resample: \n')

 df_test post resample: 
df_train_test = df_test
y_train_test=y_test
prediction <- predict(model, df_train_test)
prediction from a rank-deficient fit may be misleading
df_postResample<-postResample(pred = prediction, obs = y_train_test)
print(df_postResample)
      RMSE   Rsquared        MAE 
1.36678107 0.02705989 0.95938732 


1. What do you notice about the RMSE and R-Squared values?
on the training set, the RMSE and Rsquared are quite good, but the values are poor for the testing set, suggesting the model is overfitting. Furthermore, multiple errors were thrown during th fit.

2. Did the model perform well? Why or why not?
No, R-square is very low for testing set.

3. If not, perhaps you used the wrong type of machine learning method on the wrong type of data. See the following resource for more information: Parametric vs non-parametric methods for data analysis

So let’s dive into using some non-parametric machine learning models:

  1. Using the same general approach documented in the walkthrough and the steps outlined below, make sales volume predictions on the new products dataset after training and testing your models on the historical data set:

    1. Set seed and create training and test sets
#setup seed for reproducability
set.seed(1)
# Define Label
y <- df_preprocessed$log10_Volume
#define a 75-25% train-test split of the dataset
inTraining <- createDataPartition(y, p = .75, list = FALSE)
df_train <- df_preprocessed[inTraining,]
df_test <- df_preprocessed[-inTraining,]
y_train = df_train$log10_Volume
y_test = df_test$log10_Volume
#check dimensions of train & test set
dim(df_train); dim(df_test);
[1] 58 26
[1] 19 26
  1. Use the following 3 algorithms for your analysis; you might have to research each of these as there are variants of each in caret - you may choose which variant you need:

  2. Support Vector Machine (SVM)

       <span style='color:red'> [walkthrough link](http://dataaspirant.com/2017/01/19/support-vector-machine-classifier-implementation-r-caret-package/) <\span>
train_controls <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
#View(df_train)
 
model <- train(log10_Volume ~., data = df_train, method = "svmLinear",
                 trControl=train_controls,
                 tuneLength = 10)
Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.
model_svmLinear <- model
print(model)
Support Vector Machines with Linear Kernel 

58 samples
25 predictors

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 52, 51, 52, 53, 54, 53, ... 
Resampling results:

  RMSE      Rsquared   MAE      
  1.299165  0.5567494  0.7928724

Tuning parameter 'C' was held constant at a value of 1
cat('\n df_train post resample: \n')

 df_train post resample: 
df_train_test = df_train
y_train_test=y_train
prediction <- predict(model, df_train_test)
df_postResample<-postResample(pred = prediction, obs = y_train_test)
print(df_postResample)
     RMSE  Rsquared       MAE 
0.3243008 0.8199906 0.2046974 
cat('\n df_test post resample: \n')

 df_test post resample: 
df_train_test = df_test
y_train_test=y_test
prediction <- predict(model, df_train_test)
df_postResample<-postResample(pred = prediction, obs = y_train_test)
print(df_postResample)
     RMSE  Rsquared       MAE 
0.9348012 0.1020904 0.7404686 
  1. Random Forest
train_controls <- trainControl(method = "repeatedcv", number = 10, repeats = 1)
#View(df_train)
 
model <- train(log10_Volume ~., data = df_train, method = "rf",
                 trControl=train_controls,
                 tuneLength = 10)
model_rf <- model
print(model)
Random Forest 

58 samples
25 predictors

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 1 times) 
Summary of sample sizes: 52, 52, 52, 52, 52, 52, ... 
Resampling results across tuning parameters:

  mtry  RMSE       Rsquared   MAE      
   2    0.2929948  0.9109390  0.2245404
   4    0.2299453  0.9303843  0.1769531
   7    0.2049799  0.9385415  0.1530137
   9    0.2010769  0.9413503  0.1495050
  12    0.2004596  0.9409190  0.1518874
  14    0.2001748  0.9396720  0.1520021
  17    0.1982224  0.9403279  0.1500803
  19    0.1982045  0.9416796  0.1505520
  22    0.1973757  0.9418469  0.1510513
  25    0.2034560  0.9371050  0.1547379

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 22.
cat('\n df_train post resample: \n')

 df_train post resample: 
df_train_test = df_train
y_train_test=y_train
prediction <- predict(model, df_train_test)
df_postResample<-postResample(pred = prediction, obs = y_train_test)
print(df_postResample)
      RMSE   Rsquared        MAE 
0.08523613 0.98868133 0.05978294 
cat('\n df_test post resample: \n')

 df_test post resample: 
df_train_test = df_test
y_train_test=y_test
prediction <- predict(model, df_train_test)
df_postResample<-postResample(pred = prediction, obs = y_train_test)
print(df_postResample)
     RMSE  Rsquared       MAE 
0.2852245 0.8933604 0.2039424 
  1. Gradient Boosting
train_controls <- trainControl(method = "repeatedcv", number = 10, repeats = 1)
#View(df_train)
 
model <- train(log10_Volume ~., data = df_train, method = "xgbTree",
                 trControl=train_controls,
                 tuneLength = 10)
model_gbTree <- model
print(model)
eXtreme Gradient Boosting 

58 samples
25 predictors

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 1 times) 
Summary of sample sizes: 52, 53, 52, 52, 53, 53, ... 
Resampling results across tuning parameters:

  eta  max_depth  colsample_bytree  subsample  nrounds  RMSE       Rsquared 
  0.3   1         0.6               0.5000000   50      0.2284764  0.9216587
  0.3   1         0.6               0.5000000  100      0.2201977  0.9261489
  0.3   1         0.6               0.5000000  150      0.2143680  0.9272393
  0.3   1         0.6               0.5000000  200      0.2135276  0.9295991
  0.3   1         0.6               0.5000000  250      0.2130879  0.9301313
  0.3   1         0.6               0.5000000  300      0.2130536  0.9302487
  0.3   1         0.6               0.5000000  350      0.2120491  0.9306822
  0.3   1         0.6               0.5000000  400      0.2133532  0.9303621
  0.3   1         0.6               0.5000000  450      0.2134724  0.9298110
  0.3   1         0.6               0.5000000  500      0.2139841  0.9302042
  0.3   1         0.6               0.5555556   50      0.2205741  0.9246225
  0.3   1         0.6               0.5555556  100      0.2225953  0.9257463
  0.3   1         0.6               0.5555556  150      0.2218757  0.9280144
  0.3   1         0.6               0.5555556  200      0.2225147  0.9265356
  0.3   1         0.6               0.5555556  250      0.2268311  0.9254983
  0.3   1         0.6               0.5555556  300      0.2295099  0.9246502
  0.3   1         0.6               0.5555556  350      0.2288024  0.9257733
  0.3   1         0.6               0.5555556  400      0.2292988  0.9255461
  0.3   1         0.6               0.5555556  450      0.2298814  0.9252520
  0.3   1         0.6               0.5555556  500      0.2300765  0.9255009
  0.3   1         0.6               0.6111111   50      0.2438379  0.9205469
  0.3   1         0.6               0.6111111  100      0.2408638  0.9221003
  0.3   1         0.6               0.6111111  150      0.2324788  0.9282627
  0.3   1         0.6               0.6111111  200      0.2274411  0.9309447
  0.3   1         0.6               0.6111111  250      0.2270993  0.9312800
  0.3   1         0.6               0.6111111  300      0.2265324  0.9311225
  0.3   1         0.6               0.6111111  350      0.2267147  0.9316696
  0.3   1         0.6               0.6111111  400      0.2267091  0.9313889
  0.3   1         0.6               0.6111111  450      0.2263611  0.9316839
  0.3   1         0.6               0.6111111  500      0.2262962  0.9317878
  0.3   1         0.6               0.6666667   50      0.2080572  0.9332618
  0.3   1         0.6               0.6666667  100      0.1994282  0.9407511
  0.3   1         0.6               0.6666667  150      0.1966012  0.9429176
  0.3   1         0.6               0.6666667  200      0.1978073  0.9418478
  0.3   1         0.6               0.6666667  250      0.1969031  0.9425682
  0.3   1         0.6               0.6666667  300      0.1964608  0.9424523
  0.3   1         0.6               0.6666667  350      0.1957379  0.9427439
  0.3   1         0.6               0.6666667  400      0.1962772  0.9426362
  0.3   1         0.6               0.6666667  450      0.1963109  0.9424743
  0.3   1         0.6               0.6666667  500      0.1962442  0.9425273
  0.3   1         0.6               0.7222222   50      0.2227333  0.9373083
  0.3   1         0.6               0.7222222  100      0.2174632  0.9429406
  0.3   1         0.6               0.7222222  150      0.2139230  0.9443887
  0.3   1         0.6               0.7222222  200      0.2115789  0.9458754
  0.3   1         0.6               0.7222222  250      0.2105655  0.9473345
  0.3   1         0.6               0.7222222  300      0.2094595  0.9481239
  0.3   1         0.6               0.7222222  350      0.2096799  0.9479054
  0.3   1         0.6               0.7222222  400      0.2096419  0.9480140
  0.3   1         0.6               0.7222222  450      0.2097656  0.9480885
  0.3   1         0.6               0.7222222  500      0.2097233  0.9481247
  0.3   1         0.6               0.7777778   50      0.2224135  0.9336319
  0.3   1         0.6               0.7777778  100      0.2075947  0.9408899
  0.3   1         0.6               0.7777778  150      0.2021852  0.9441377
  0.3   1         0.6               0.7777778  200      0.2033810  0.9436543
  0.3   1         0.6               0.7777778  250      0.2021864  0.9438297
  0.3   1         0.6               0.7777778  300      0.2013067  0.9443086
  0.3   1         0.6               0.7777778  350      0.2013855  0.9441724
  0.3   1         0.6               0.7777778  400      0.2008850  0.9443448
  0.3   1         0.6               0.7777778  450      0.2005891  0.9442941
  0.3   1         0.6               0.7777778  500      0.2005402  0.9444586
  0.3   1         0.6               0.8333333   50      0.1996858  0.9441244
  0.3   1         0.6               0.8333333  100      0.1956256  0.9480938
  0.3   1         0.6               0.8333333  150      0.1931266  0.9495668
  0.3   1         0.6               0.8333333  200      0.1913877  0.9504222
  0.3   1         0.6               0.8333333  250      0.1912773  0.9505270
  0.3   1         0.6               0.8333333  300      0.1904507  0.9508640
  0.3   1         0.6               0.8333333  350      0.1901998  0.9507225
  0.3   1         0.6               0.8333333  400      0.1898479  0.9506629
  0.3   1         0.6               0.8333333  450      0.1893253  0.9510757
  0.3   1         0.6               0.8333333  500      0.1895375  0.9509559
  0.3   1         0.6               0.8888889   50      0.2080856  0.9415228
  0.3   1         0.6               0.8888889  100      0.2035190  0.9443244
  0.3   1         0.6               0.8888889  150      0.1990739  0.9474486
  0.3   1         0.6               0.8888889  200      0.1986480  0.9485890
  0.3   1         0.6               0.8888889  250      0.1953802  0.9498755
  0.3   1         0.6               0.8888889  300      0.1950671  0.9502998
  0.3   1         0.6               0.8888889  350      0.1947465  0.9503960
  0.3   1         0.6               0.8888889  400      0.1937253  0.9508464
  0.3   1         0.6               0.8888889  450      0.1934311  0.9510370
  0.3   1         0.6               0.8888889  500      0.1930652  0.9513522
  0.3   1         0.6               0.9444444   50      0.2244006  0.9279392
  0.3   1         0.6               0.9444444  100      0.2204402  0.9328219
  0.3   1         0.6               0.9444444  150      0.2201283  0.9334331
  0.3   1         0.6               0.9444444  200      0.2176985  0.9352359
  0.3   1         0.6               0.9444444  250      0.2160658  0.9363424
  0.3   1         0.6               0.9444444  300      0.2150038  0.9370625
  0.3   1         0.6               0.9444444  350      0.2139317  0.9377397
  0.3   1         0.6               0.9444444  400      0.2136432  0.9379105
  0.3   1         0.6               0.9444444  450      0.2132278  0.9382405
  0.3   1         0.6               0.9444444  500      0.2130847  0.9384021
  0.3   1         0.6               1.0000000   50      0.2190020  0.9328445
  0.3   1         0.6               1.0000000  100      0.2174849  0.9347607
  0.3   1         0.6               1.0000000  150      0.2135116  0.9370476
  0.3   1         0.6               1.0000000  200      0.2119595  0.9383646
  0.3   1         0.6               1.0000000  250      0.2110799  0.9394100
  0.3   1         0.6               1.0000000  300      0.2109370  0.9399502
  0.3   1         0.6               1.0000000  350      0.2108029  0.9401800
  0.3   1         0.6               1.0000000  400      0.2109052  0.9403651
  0.3   1         0.6               1.0000000  450      0.2111344  0.9405818
  0.3   1         0.6               1.0000000  500      0.2109450  0.9407897
  0.3   1         0.8               0.5000000   50      0.2204962  0.9291571
  0.3   1         0.8               0.5000000  100      0.2133923  0.9351065
  0.3   1         0.8               0.5000000  150      0.2181669  0.9345079
  0.3   1         0.8               0.5000000  200      0.2186196  0.9356464
  0.3   1         0.8               0.5000000  250      0.2187786  0.9362180
  0.3   1         0.8               0.5000000  300      0.2192147  0.9366604
  0.3   1         0.8               0.5000000  350      0.2192758  0.9365808
  0.3   1         0.8               0.5000000  400      0.2194370  0.9368750
  0.3   1         0.8               0.5000000  450      0.2197547  0.9367156
  0.3   1         0.8               0.5000000  500      0.2193786  0.9370187
  0.3   1         0.8               0.5555556   50      0.2192674  0.9353523
  0.3   1         0.8               0.5555556  100      0.2134775  0.9354737
  0.3   1         0.8               0.5555556  150      0.2136330  0.9366034
  0.3   1         0.8               0.5555556  200      0.2133238  0.9365663
  0.3   1         0.8               0.5555556  250      0.2138913  0.9365379
  0.3   1         0.8               0.5555556  300      0.2134270  0.9371768
  0.3   1         0.8               0.5555556  350      0.2141235  0.9368013
  0.3   1         0.8               0.5555556  400      0.2135072  0.9371002
  0.3   1         0.8               0.5555556  450      0.2135287  0.9371336
  0.3   1         0.8               0.5555556  500      0.2136391  0.9371227
  0.3   1         0.8               0.6111111   50      0.2351326  0.9185414
  0.3   1         0.8               0.6111111  100      0.2267814  0.9215612
  0.3   1         0.8               0.6111111  150      0.2280867  0.9209016
  0.3   1         0.8               0.6111111  200      0.2283000  0.9212449
  0.3   1         0.8               0.6111111  250      0.2266990  0.9223274
  MAE      
  0.1730951
  0.1602175
  0.1568886
  0.1542488
  0.1542443
  0.1557025
  0.1551318
  0.1569120
  0.1572759
  0.1577505
  0.1683410
  0.1694589
  0.1687065
  0.1695633
  0.1714519
  0.1741921
  0.1732685
  0.1731203
  0.1741805
  0.1739121
  0.1855337
  0.1843745
  0.1766099
  0.1728038
  0.1723721
  0.1722707
  0.1726015
  0.1722922
  0.1719622
  0.1718547
  0.1639804
  0.1558434
  0.1523086
  0.1531895
  0.1518745
  0.1514437
  0.1505395
  0.1508068
  0.1509261
  0.1505407
  0.1676248
  0.1599968
  0.1570220
  0.1546012
  0.1543435
  0.1528426
  0.1521437
  0.1516440
  0.1515882
  0.1514758
  0.1714308
  0.1546852
  0.1493024
  0.1497917
  0.1492983
  0.1487080
  0.1486449
  0.1485261
  0.1483461
  0.1482555
  0.1612491
  0.1571013
  0.1513353
  0.1493654
  0.1486791
  0.1478206
  0.1477153
  0.1473693
  0.1470940
  0.1472936
  0.1576582
  0.1539364
  0.1513803
  0.1511592
  0.1488659
  0.1491724
  0.1495037
  0.1490500
  0.1488417
  0.1486755
  0.1685531
  0.1636854
  0.1628784
  0.1611892
  0.1602318
  0.1596934
  0.1588058
  0.1586465
  0.1582298
  0.1580645
  0.1621901
  0.1613692
  0.1577885
  0.1558092
  0.1544963
  0.1543865
  0.1539614
  0.1540608
  0.1541704
  0.1537880
  0.1733672
  0.1677258
  0.1694761
  0.1697354
  0.1698512
  0.1703809
  0.1703514
  0.1708299
  0.1713203
  0.1713107
  0.1596008
  0.1543531
  0.1558798
  0.1558129
  0.1566255
  0.1562837
  0.1567543
  0.1562982
  0.1564576
  0.1564208
  0.1853223
  0.1708051
  0.1703386
  0.1705794
  0.1699050
 [ reached getOption("max.print") -- omitted 3875 rows ]

Tuning parameter 'gamma' was held constant at a value of 0
Tuning
 parameter 'min_child_weight' was held constant at a value of 1
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were nrounds = 50, max_depth = 10, eta =
 0.3, gamma = 0, colsample_bytree = 0.8, min_child_weight = 1 and subsample
 = 0.6111111.
cat('\n df_train post resample: \n')

 df_train post resample: 
df_train_test = df_train
y_train_test=y_train
prediction <- predict(model, df_train_test)
df_postResample<-postResample(pred = prediction, obs = y_train_test)
print(df_postResample)
       RMSE    Rsquared         MAE 
0.004438720 0.999970293 0.002037815 
cat('\n df_test post resample: \n')

 df_test post resample: 
df_train_test = df_test
y_train_test=y_test
prediction <- predict(model, df_train_test)
df_postResample<-postResample(pred = prediction, obs = y_train_test)
print(df_postResample)
     RMSE  Rsquared       MAE 
0.3040302 0.8883087 0.2165503 
  1. Be sure to take any precautions needed to guard against overfitting and longer training times

  2. Apply each of your models to your testing data as you have done in the previous task using the predict() function in R.
    1. Example: Predictions<-predict(TrainedModelName, newdata=testSet).
      This was done in-line with the model training
  3. Review your models and identify the one that performed best without overfitting. You should also look at the predicted values themselves. If you have negative vales in your predictions and negative values are not possible for your dependent variable, choose a different model. Be prepared to explain why you chose to use the algorithms you did in your report.

compare_models <- function (model_list, df_train, df_test, label_column){
    for (i in 1:length(model_list)){
      
      model <- model_list[i]
      model_name <- list(model[[1]]$method)[[1]]
      
      cat(paste('\n ----- model_name:',model_name,'-----'))
      cat('\n df_train post resample: \n')
      df_train_test = df_train
      y_train_test = df_train[,c(label_column)]
      prediction <- predict(model, df_train_test)[[1]]
            
      df_postResample<-postResample(pred = prediction, obs = y_train_test)
      print(df_postResample)
    
      plot(y_train_test,prediction,xlab = 'Label',ylab = 'Prediction', col='blue', main = model_name)
      par(new=FALSE)
      
      cat('\n df_test post resample: \n')
      df_train_test = df_test
      y_train_test = df_test[,c(label_column)]
      prediction <- predict(model, df_train_test)[[1]]
      df_postResample<-postResample(pred = prediction, obs = y_train_test)
      print(df_postResample)
      
      points(y_train_test,prediction,xlab = 'Label',ylab = 'Prediction', col='red')
      legend(4, legend = list('Train','Test'),col=c('Blue','Red'),pch='o')
    }
}
model_list <- list(model_svmLinear, model_rf, model_gbTree)
compare_models(model_list, df_train, df_test, label_column = 'log10_Volume')

 ----- model_name: svmLinear -----
 df_train post resample: 
     RMSE  Rsquared       MAE 
0.3243008 0.8199906 0.2046974 

 df_test post resample: 
     RMSE  Rsquared       MAE 
0.9348012 0.1020904 0.7404686 

 ----- model_name: rf -----
 df_train post resample: 
      RMSE   Rsquared        MAE 
0.08523613 0.98868133 0.05978294 


 df_test post resample: 
     RMSE  Rsquared       MAE 
0.2852245 0.8933604 0.2039424 

 ----- model_name: xgbTree -----
 df_train post resample: 
       RMSE    Rsquared         MAE 
0.004438720 0.999970293 0.002037815 


 df_test post resample: 
     RMSE  Rsquared       MAE 
0.3040302 0.8883087 0.2165503 

the random forest and xgbTree appear the best, however the xgb tree seems to be overfitting, based on the R-squared score and the nearly perfect linear correlation between the label and the prediction in the label vs prediction plot, so we will proceed using the random forest as the best model

  1. After choosing a model, you will need to prepare the new products data set for prediction. Anything that has been done to the structure of the existing products data needs to repeated for new products. With new products, use dummyVars() and then remove any attribute that you removed from the existing products data sets. When using dummyVars, be sure to change the name of the object you are creating so that you don’t overwrite your earlier work. Example: newDataframe should be changed to newDataframe2 where ever it appears in your dummyVar work.
#load the data set
df_validation <- read_csv("newproductattributes2017.csv");
Parsed with column specification:
cols(
  ProductType = col_character(),
  ProductNum = col_double(),
  Price = col_double(),
  x5StarReviews = col_double(),
  x4StarReviews = col_double(),
  x3StarReviews = col_double(),
  x2StarReviews = col_double(),
  x1StarReviews = col_double(),
  PositiveServiceReview = col_double(),
  NegativeServiceReview = col_double(),
  Recommendproduct = col_double(),
  BestSellersRank = col_double(),
  ShippingWeight = col_double(),
  ProductDepth = col_double(),
  ProductWidth = col_double(),
  ProductHeight = col_double(),
  ProfitMargin = col_double(),
  Volume = col_double()
)
# one-hot encode (dummify) the data
df_validation_preprocessed <- dummyVars(" ~.",data = df_validation)
df_validation_preprocessed <- data.frame(predict(df_validation_preprocessed,newdata = df_validation))
#drop columsn that contain NA
drops <- c("BestSellersRank","x1StarReviews","x5StarReviews")
df_validation_preprocessed <- df_validation_preprocessed[,!(names(df_validation_preprocessed) %in% drops)]
#Make predictions
model <- model_gbTree
df_train_test = df_validation_preprocessed
prediction <- predict(model, df_train_test)
prediction_validation <- prediction
prediction_validation_Volume <- 10^(prediction_validation)
#Add predictions to df
df_validation_w_predictions <- df_validation
df_validation_w_predictions['Predicted_Volume'] <- prediction_validation_Volume
#sort the df
df_validation_w_predictions <- df_validation_w_predictions[order(df_validation_w_predictions$Predicted_Volume),]
#Add unique ID column
df_validation_w_predictions['ProductType_ProductNumber_Price']<- with(df_validation_w_predictions, paste0(ProductType,'_#', ProductNum,'_$', Price))
par(mar=c(11,4,1,1))
barplot(height = df_validation_w_predictions$Predicted_Volume, names.arg = df_validation_w_predictions$ProductType_ProductNumber_Price, las=2, cex.axis = .8 , cex.names = 0.8, ylab = 'Volume')
#aggregate by Product Type
df_ProductType_aggregate <- aggregate(df_validation_w_predictions$Predicted_Volume, by=list(Category=df_validation_w_predictions$ProductType), FUN=sum)
colnames(df_ProductType_aggregate) <- c("ProductType", "Total_Predicted_Volume")
# sort the aggregate
df_ProductType_aggregate <- df_ProductType_aggregate[order(df_ProductType_aggregate$Total_Predicted_Volume),]
par(mar=c(9,4,1,4))

barplot(height = df_ProductType_aggregate$Total_Predicted_Volume, names.arg = df_ProductType_aggregate$ProductType, las=2, ylab = 'Total Volume')

#Plot Ratings and Reviews vs. Volume
x<-log10(df_validation_w_predictions$x4StarReviews)
y<-log10(df_validation_w_predictions$Predicted_Volume)
plot(x,y,col='red',xlab = "log10(# of ratings)", ylab = "log10(Predicted_Volume)")
x<-log10(df_validation_w_predictions$x3StarReviews)
points(x,y,col='green')
x<-log10(df_validation_w_predictions$x2StarReviews)
points(x,y,col='blue')
legend(2.1,2.2, legend = list('4 Stars','3 Stars','2 Stars'),col=c('red','green','blue'),pch='o')

#Plot service reviews
x<-log10(df_validation_w_predictions$PositiveServiceReview)
y<-log10(df_validation_w_predictions$Predicted_Volume)
plot(x,y,col='green',xlab = "log10(# of Service Reviews)", ylab = "log10(Predicted_Volume)")
x<-log10(df_validation_w_predictions$NegativeServiceReview)
points(x,y,col='red')
legend(2, legend = list('Positive','Negative'),col=c('green','red'),pch='o')

  1. Once new products is prepared, use the predict() function again. This time with the new products dataset to create your final predictions in an object called finalPred.

Often times it is helpful for report building to output your data set and predictions from RStudio. Let’s add your predictions to the new products data and then create a csv file. Use your csv and Excel to organize your data for reporting.

  • Add predictions to the new products data set

This was done in the previous part

  • Create a csv file and write it to your hard drive. Note: You may need to use your computer’s search function to locate your output file.
write.csv(df_validation_w_predictions, file="C2.T3output.csv", row.names = FALSE)
  1. Use Excel to organize your predictions. Remember the four product types you need to focus on: PC, Laptops, Netbooks and Smartphones

We’ll just stick to organizing the data in R



3. Write an informal report

Write an informal report to Danielle Sherman, in Word or PowerPoint, describing your analysis. In addition to presenting your findings, you might address questions such as the following:



Multiple Regression in R Report

Introduction

In this report we review multiple regression techniques used to predict the volumes for new products from Blackwell Electronics. In developing these multiple regression models, we performed a number of preprocessing steps to elliminate features with high colinearity, scale the data so all features had similar numeric ranges, one-hot encoded (dummified) categorical string data to allow the models to leverage these categorical features, and transformed the label of interest, Volume, into a log-scale to prevent the model from ever predicting volumes of less than 0.

Preprocessing

The “existingproductattributes2017.csv” data set was used the build the multiple regression models. After pulling the data into R, the categorical feature, Product Type, was one-hot encoded (dummified). In this step, the “dummyVars” function basically determines the number of categories, n_cat of Producty Types, then for each row of data, n_cat columns are created, with each column labeled as one of the product types. The cells of these new columns are then populated with zeros, if that given data row does not correspond to the particular product type in that column, and a value of one is populated in the cell of the product type column that the row of data was originally labeled to correspond to. In this way, we transformed string-based categorical data into numeric data, which the machine learning algorithms can leverage as features in making predictions and training.

Following the one-hot encoding, we analyzed the data to see if any columns contained “NA” values. “BestSellersRank” was observed to contain 15 NA cells, thus this column was dropped as a relevant feature.

Next, we analyzed the correlations in the data sets using a correlation plot, shown below

Correlation Plot Here, the deep blue colors represent strong positive correlations, while the deep red cells represent strong negative correlations. Using the correlation covariance values from this table, we filtered out the features with colinearities >0.95 (“x1StarReviews” with “x2StarReviews”. Furthermore, we discovered the “x5starReviews” feature had a perfect correlation of 1 with the “Volumes” label. This is a suspiciously good correlation between a feature and label, thus we analyzed the data in a scatter plot, shown below.

Volume vs x5StarReviews As can be seen, this feature and our label of interest have a perfect correlation, which implies there is likely some data entry error with the “x5starReviews” feature. For this reason, this column was droped as a relevant feature for our models.

Following the exclusion of the “x1StarReviews” and “x5StarReviews” features, we transformed the label column (“Volume”) into a log10 scale. This was done to prevent any of the models from ever predicting negative volumes, since a negative prediction for log10(Volume), would simply correspond to a volume <1 (10^-1 = 0.1).

Finally, the data was split into a training and testing set using a 75-25% train-test partition.

Training & Testing the Models

Three models were evaluated: (1) support vector machine (SVM) with linear kernel, (2) random forest (RF), and (3) eXtreme gradient boosted tree. From each model, the train and test RMSE and R-squared was calculated. The table below shows the summary of the results.

Train Test Metrics

“Train Test Metrics”

To visualize the results, we also plotted the true log10(Volume) label vs. the log10(Volume) prediction. Label vs. prediction plots Viewing the RMSE & R-squared summary table, along with the label vs. prediction plots, we can see that the RF and gbTree are the best models. However, comparing these two models, we also see that the gbTree has an R-squared nearly equal to one (0.9999) and the trend in the label vs prediction plot is almost perfectly linear on the training set. These two facts suggest this model is overfitting the training data and thus the RF is a better model for generalization.

Predicting New Product Volumes

Using the trained RF model, we performed predictions for the new products defined in the “newproductattributes.csv” data set. Prior to feeding the data into the trained model, we carried out the previously mentioned preprocessing steps (one-hot encode the product type, drop the “BestSellerRank”, “x1StarReviews”, and “x5StarReviews” columns, and transform volume to log10(volume)). After predicting the log10(Volume) for each case in the new product attritubes table, we extracted the predicted volume from the predicted log10(Volume) column. The bar chart below shows the breakdown of the Total (aggregate) predicted volume vs. Product Type.

Total_Volume_vs_Product_Type Here, we can see Tablets and Game Consoles are expected to have the highest sales volumes. Diving deeper into the data, we can breakdown the products further by product type, product number, and price. The bar chart below shows this breakdown.

Volume_breakdown From these, we can more clearly see which unique product is expected to have the highest sales volume. Specifically, we see Tablet #187, sold at $199, contributes to the majority of the total volume sold by tablets, while Game Console #307 and #199 have nearly equal contribution to the total game console volume sold.

These conclusions have 2 key business value implications: (1) if the objective of sales is to minimize the number of products while maximizing sales, then focusing on Tablet #187 is the best course of action. (2) if the objective of sales is to offer the widest range of product types, while maximizing sales volume, then the team should focus on PC#17, Tablet #186, Smartphone #194, Netbook #180, Game Consoles #307 and/or #199, and Tablet #187.

Finally, the last prediction that may be of interest to the Sales team is the impact of customer rating and sales review on Volume. The scatter plot below shows the predicted volume vs # of ratings for 4 star, 3 star, and 2 star ratings, all on a log-log scale.

Volume_breakdown Here, we can see that there is essentially a linear relationship between log10(# of ratings) and log10(predicted Volume).

Similar to the # of ratings, we also see a linear relationship between the log10(Volume) and log10(# of service reviews (positive and negative)), as can be seen in the plot below. Volume_breakdown

Task 2 Opinions/Comments

Overall I found this to be the most challenging of the tasks we have completed, largely because it required more individual exploration, rather than following the plan of attack line by line. Overall, I think this style of the activity was very educational. The part I had the most trouble with was running the initial linear model, as the errors the model was throwing were somewhat strange and there wasn’t a consistant answer online as to what they actually mean. Other than that though, I did find it pretty straightforward to rerun predictions using different models, though I wish R had better “function” capabilities, more similar to python, because I did find myself just copying and pasting lines of code because I didn’t feel like dealing with the unique characteristics of R functions.

---
title: 'Task 3: Multiple Regression in R'
output:
  html_notebook: default
  pdf_document: default
---
***
***
#### Note to the reader

> <span style='color:red'> markdown comments noted by the student/author (John Leonard) are highlighted in red. The final section of the document (section 7) contains the informal written report </span>

***
***
### Your Task:
__FROM:__ Danielle Sherman <br>
__Subject:__ Brand Preference Prediction
Hello,

The sales team has again consulted with me with some concerns about ongoing product sales in one of our stores. Specifically, they have been tracking the sales performance of specific product types and would like us to redo our previous sales prediction analysis, but this time they'd like us to include the ‘product type’ attribute in our predictions to better understand how specific product types perform against each other. __They have asked our team to analyze historical sales data and then make sales volume predictions for a list of new product types, some of which are also from a previous task.__ This will help the sales team better understand how types of products might impact sales across the enterprise.

I have attached historical sales data and new product data sets to this email. I would like for you to do the analysis with the goals of:

Predicting sales of four different product types: PC, Laptops, Netbooks and Smartphones
Assessing the impact services reviews and customer reviews have on sales of different product types
When you have completed your analysis, please submit a brief report that includes the methods you employed and your results. I would also like to see the results exported from R for each of the methods.

Thanks,

Danielle

__[existingproductattributes2017](https://s3.amazonaws.com/gbstool/emails/2902/existingproductattributes2017.csv?AWSAccessKeyId=AKIAJBIZLMJQ2O6DKIAA&Expires=1550048400&Signature=2foIhxJQPS2MYq5ZMuV0Vyp1hFY%3D)__ </br>
__[newproductattributes2017](https://s3.amazonaws.com/gbstool/emails/2902/newproductattributes2017.csv?AWSAccessKeyId=AKIAJBIZLMJQ2O6DKIAA&Expires=1550048400&Signature=i32K%2Fx1OcQa0iisFdWwPGoedIWM%3D)__</br>

***
***
## Plan of Attack
### Introduction
#### Your Task
You have been asked by Danielle Sherman, CTO of Blackwell Electronics, to __predict the sales in four different product types while assessing the effects service and customer reviews have on sales__. You'll be __using Regression__ to build machine learning models for this analyses using a __choice of two of three popular algorithms__. Once you have determined which one works better on the provided data set, Danielle would like you to __predict the sales of four product types from the new products list__ and prepare a report of your findings.

This task requires you to prepare one deliverable for Danielle Sherman:

Sales Prediction Report. A report in a Zip file that includes:

* A brief summary in Word or PowerPoint of your methods and results that include:
     * The algorithms you tried. 
     * The algorithm you selected to make the predictions, including a rationale for selecting the method you did and the level of confidence in the predictions.
     * Your sales predictions for four target product types found in the new product attributes data set
     * A chart that displays the impact of customer and service reviews have on sales volume. 
* The results of each model you constructed, exported from R
The steps in the following tabs will walk you through this process.

***
***
### <span style='color:red'> Setup </span>
```{r}
#Load the libraries & set random seed
library(caret)
library(readr) 
set.seed(1)

#load the data set
df <- read_csv("existingproductattributes2017.csv");
```
### 1. Pre-Process the Data
In previous Regression tasks, you needed to remove non-numeric features to make predictions; however, typical datasets don’t contain only numeric values. Most data will contain a mixture of numeric and nominal data so we need to understand how to incorporate both when it comes to developing regression models and making predictions. 

Categorical variables may be used directly as predictor or predicted variables in a multiple regression model as long as they've been converted to binary values. In order to pre-process the sales data as needed we first need to convert all factor or 'chr' classes to binary features that contain ‘0’ and ‘1’ classes. Fortunately, caret has a method for creating these 'Dummy Variables' as follows:

```{r}
# one-hot encode (dummify) the data
df_preprocessed <- dummyVars(" ~.",data = df)
df_preprocessed <- data.frame(predict(df_preprocessed,newdata = df))

colnames(df_preprocessed)
```
#### Correlation

Correlation as you likely already know about is a measure of the relationship between two or more features or variables. In this problem, you were tasked with ascertaining if some specific features impact on weekly sales volume.

1. In order to measure the correlation between the variables in the data, all variables must not contain nominal data types. Use the str() to check all of the datatypes in your dataframe. 
```{r}
str(df_preprocessed) #Check data structure
```

2. Now use summary() to check for missing data. Missing data is represented by "NA". There are many methods of addressing missing data, but for now let's delete any attribute that has missing information. 
```{r}
summary(df_preprocessed)
```
```{r}
#drop columsn that contain NA
drops <- c("BestSellersRank")
df_preprocessed <- df_preprocessed[,!(names(df_preprocessed) %in% drops)]

names(df_preprocessed)

```
3. While correlation doesn't always imply causation you can start your analysis by finding the correlation between the relevant independent variables and the dependent variable. In the next steps you will use the cor() function to create a correlation matrix that you can visualize to ascertain the correlation between all of the features.
4. Use the cor() function to build the correlation matrix:
```{r}
df_corr <-cor(df_preprocessed)
```
Correlation values fall within -1 and 1 with variables have string positive relationships having correlation values closer to 1 and strong negative relationships with values closer to -1. What kind of relationship do two variables with a correlation of '0' have?

<span style="color:red"> 0 correlation corresponds to no linear relationship between the two columns being correlated. </span>

It is often very helpful to visualize the correlation matrix with a heat map so we can 'see' the impact different variables have on one another. To generate a heat map for your correlation matrix we'll use corrplot package as follows:
```{r}
#install.packages("corrplot")
library(corrplot)
corrplot(df_corr,order="hclust",tl.col="black", tl.srt=90,tl.cex = .45)
```
blue (cooler) colors show a positive relationship and red (warmer) colors indicate more negative relationships. Knowing what you do about correlation - what do you think intersections in the chart without colors represent?

<span style="color:red"> 0 correlation corresponds to no linear relationship between the two columns being correlated. </span>

Using the heat map, review the service and customer review relationships with sales volume and note the associated correlations for your report. If you would like more detailed correlation figures than those available with the heat map, enter the name of your correlation object into console and review the printed information.

Now that you know the relationships between all of the variables in the data it is a good time to remove any features that aren't needed for your analysis.
```{r}
# Search for cross correlations > 0.95 and < 1 that aren't related to the label_column ("Volume")
label_column <- "Volume"
drops <- c(label_column)
df_corr_abs <- abs(df_corr)
df_corr_abs <- df_corr_abs[,!(colnames(df_corr_abs) %in% drops)] #drop the label column so you dont remove features correlated with this label
for (col_name in c(colnames(df_corr_abs))){
  df_column <-df_corr_abs[,col_name]
  df_strong_corr<-df_column[df_column>0.95]
  if (length(df_strong_corr)>0) {
    print(col_name)
    print(df_strong_corr)
  }
}
```
```{r}
# Delete one of the pairs from each correlation
drops <- c("x1StarReviews","x5StarReviews")
df_preprocessed<-df_preprocessed[,!(colnames(df_preprocessed) %in% drops)] 

#Transform Volume to log10_Volume to prevent predictions of <0 volume
df_preprocessed<-df_preprocessed[!(df$Volume==0),]
df_preprocessed['log10_Volume'] <- log10(df_preprocessed$Volume)
drops <- c("Volume")
df_preprocessed<-df_preprocessed[,!(colnames(df_preprocessed) %in% drops)] 
```

```{r}
#Visualize the data
plot_summary_of_data<-function(DatasetName,x_index=1){
  
  column_names = names(DatasetName)
  
  subplot_cols = 2
  subplot_rows = 2
  par(mfrow=c(subplot_rows,subplot_cols))  
  
  x <- unlist(DatasetName[,x_index])
  x_header = column_names[x_index]
  
  for(i in 1:length(column_names)){
    
    if(i != x_index) {
    y <- unlist(DatasetName[,i])
    y_header = column_names[i]
    
    try(plot(x,y, xlab = x_header, ylab = y_header),silent=TRUE)  #Scatter (Box) Plot
    } 
  }
}

plot_summary_of_data(df_preprocessed,x_index=26)
```


***
***
### 2. Develop Multiple Regression Models
In this step you will build models, make predictions and learn which algorithms are appropriate for parametric and non-parametric data sets.

1. Using the steps in 'R Walkthrough' that outlined train a linear model, create a linear model that uses volume as its dependent variable. Use the summary() function of R to evaluate the model and make a specific note of the R-Squared value.
```{r}
#setup seed for reproducability
set.seed(1)

# Define Label
y <- df_preprocessed$log10_Volume

#define a 75-25% train-test split of the dataset
inTraining <- createDataPartition(y, p = .75, list = FALSE)
df_train <- df_preprocessed[inTraining,]
df_test <- df_preprocessed[-inTraining,]

y_train = df_train$log10_Volume
y_test = df_test$log10_Volume

#check dimensions of train & test set
dim(df_train); dim(df_test);

View(df_train)
```
```{r}
train_controls <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
model <- train(log10_Volume ~., data = df_train, method = "lm", trControl=train_controls)
print(model)

cat('\n df_train post resample: \n')
df_train_test = df_train
y_train_test=y_train
prediction <- predict(model, df_train_test)
df_postResample<-postResample(pred = prediction, obs = y_train_test)
print(df_postResample)

cat('\n df_test post resample: \n')
df_train_test = df_test
y_train_test=y_test
prediction <- predict(model, df_train_test)
df_postResample<-postResample(pred = prediction, obs = y_train_test)
print(df_postResample)
```
</br>
     1. What do you notice about the RMSE and R-Squared values?
     </br>
      <span style='color:red'> on the training set, the RMSE and Rsquared are quite good, but the values are poor for the testing set, suggesting the model is overfitting. Furthermore, multiple errors were thrown during th fit. </span>
</br>      
     2. Did the model perform well? Why or why not? 
     </br>
     <span style='color:red'> No, R-square is very low for testing set.  </span>
</br>     
     3. If not, perhaps you used the wrong type of machine learning method on the wrong type of data. See the following resource for more information: [Parametric vs non-parametric methods for data analysis](https://s3.amazonaws.com/gbstool/courses/883/docs/ParametricNonparametric_Altman_2009.pdf?AWSAccessKeyId=AKIAJBIZLMJQ2O6DKIAA&Expires=1550480400&Signature=PRY9DUNqxdpRErkrhPTBOgBAOzc%3D)
     
So let's dive into using some non-parametric machine learning models:

1. Using the same general approach documented in the walkthrough and the steps outlined below, make sales volume predictions on the new products dataset after training and testing your models on the historical data set:

     1. Set seed and create training and test sets
```{r}
#setup seed for reproducability
set.seed(1)

# Define Label
y <- df_preprocessed$log10_Volume

#define a 75-25% train-test split of the dataset
inTraining <- createDataPartition(y, p = .75, list = FALSE)
df_train <- df_preprocessed[inTraining,]
df_test <- df_preprocessed[-inTraining,]

y_train = df_train$log10_Volume
y_test = df_test$log10_Volume

#check dimensions of train & test set
dim(df_train); dim(df_test);
```
    
2. Use the following 3 algorithms for your analysis; you might have to research each of these as there are variants of each in caret - you may choose which variant you need:
     
   1. Support Vector Machine (SVM)
           
           <span style='color:red'> [walkthrough link](http://dataaspirant.com/2017/01/19/support-vector-machine-classifier-implementation-r-caret-package/) <\span>
```{r}
train_controls <- trainControl(method = "repeatedcv", number = 10, repeats = 3)

#View(df_train)
 
model <- train(log10_Volume ~., data = df_train, method = "svmLinear",
                 trControl=train_controls,
                 tuneLength = 10)
model_svmLinear <- model
print(model)

cat('\n df_train post resample: \n')
df_train_test = df_train
y_train_test=y_train
prediction <- predict(model, df_train_test)
df_postResample<-postResample(pred = prediction, obs = y_train_test)
print(df_postResample)

cat('\n df_test post resample: \n')
df_train_test = df_test
y_train_test=y_test
prediction <- predict(model, df_train_test)
df_postResample<-postResample(pred = prediction, obs = y_train_test)
print(df_postResample)
```
           
  2. Random Forest
```{r}
train_controls <- trainControl(method = "repeatedcv", number = 10, repeats = 1)
#View(df_train)
 
model <- train(log10_Volume ~., data = df_train, method = "rf",
                 trControl=train_controls,
                 tuneLength = 10)
model_rf <- model
print(model)

cat('\n df_train post resample: \n')
df_train_test = df_train
y_train_test=y_train
prediction <- predict(model, df_train_test)
df_postResample<-postResample(pred = prediction, obs = y_train_test)
print(df_postResample)

cat('\n df_test post resample: \n')
df_train_test = df_test
y_train_test=y_test
prediction <- predict(model, df_train_test)
df_postResample<-postResample(pred = prediction, obs = y_train_test)
print(df_postResample)
```

  3. Gradient Boosting
      
```{r}
train_controls <- trainControl(method = "repeatedcv", number = 10, repeats = 1)
#View(df_train)
 
model <- train(log10_Volume ~., data = df_train, method = "xgbTree",
                 trControl=train_controls,
                 tuneLength = 10)
model_gbTree <- model
print(model)

cat('\n df_train post resample: \n')
df_train_test = df_train
y_train_test=y_train
prediction <- predict(model, df_train_test)
df_postResample<-postResample(pred = prediction, obs = y_train_test)
print(df_postResample)

cat('\n df_test post resample: \n')
df_train_test = df_test
y_train_test=y_test
prediction <- predict(model, df_train_test)
df_postResample<-postResample(pred = prediction, obs = y_train_test)
print(df_postResample)
```
      
           
2. Be sure to take any precautions needed to guard against overfitting and longer training times

3. Apply each of your models to your testing data as you have done in the previous task using the predict() function in R. 
</br>
     1. Example: Predictions<-predict(TrainedModelName, newdata=testSet).
     </br>
     <span style='color:red'> This was done in-line with the model training </span>
     
4. Review your models and identify the one that performed best without overfitting. You should also look at the predicted values themselves. If you have negative vales in your predictions and negative values are not possible for your dependent variable, choose a different model. Be prepared to explain why you chose to use the algorithms you did in your report.

```{r}

compare_models <- function (model_list, df_train, df_test, label_column){
    for (i in 1:length(model_list)){
      
      model <- model_list[i]
      model_name <- list(model[[1]]$method)[[1]]
      
      cat(paste('\n ----- model_name:',model_name,'-----'))
      cat('\n df_train post resample: \n')
      df_train_test = df_train
      y_train_test = df_train[,c(label_column)]
      prediction <- predict(model, df_train_test)[[1]]
            
      df_postResample<-postResample(pred = prediction, obs = y_train_test)
      print(df_postResample)
    
      plot(y_train_test,prediction,xlab = 'Label',ylab = 'Prediction', col='blue', main = model_name)
      par(new=FALSE)
      
      cat('\n df_test post resample: \n')
      df_train_test = df_test
      y_train_test = df_test[,c(label_column)]
      prediction <- predict(model, df_train_test)[[1]]
      df_postResample<-postResample(pred = prediction, obs = y_train_test)
      print(df_postResample)
      
      points(y_train_test,prediction,xlab = 'Label',ylab = 'Prediction', col='red')
      legend(4, legend = list('Train','Test'),col=c('Blue','Red'),pch='o')
    }
}

model_list <- list(model_svmLinear, model_rf, model_gbTree)
compare_models(model_list, df_train, df_test, label_column = 'log10_Volume')

```

<span style='color:red'> the random forest and xgbTree appear the best, however the xgb tree seems to be overfitting, based on the R-squared score and the nearly perfect linear correlation between the label and the prediction in the label vs prediction plot, so we will proceed using the random forest as the best model </span>

5. After choosing a model, you will need to prepare the new products data set for prediction. Anything that has been done to the structure of the existing products data needs to repeated for new products. With new products, use dummyVars() and then remove any attribute that you removed from the existing products data sets. When using dummyVars, be sure to change the name of the object you are creating so that you don't overwrite your earlier work. Example: newDataframe should be changed to newDataframe2 where ever it appears in your dummyVar work. 

```{r}
#load the data set
df_validation <- read_csv("newproductattributes2017.csv");

# one-hot encode (dummify) the data
df_validation_preprocessed <- dummyVars(" ~.",data = df_validation)
df_validation_preprocessed <- data.frame(predict(df_validation_preprocessed,newdata = df_validation))

#drop columsn that contain NA
drops <- c("BestSellersRank","x1StarReviews","x5StarReviews")
df_validation_preprocessed <- df_validation_preprocessed[,!(names(df_validation_preprocessed) %in% drops)]

#Make predictions
model <- model_gbTree
df_train_test = df_validation_preprocessed
prediction <- predict(model, df_train_test)
prediction_validation <- prediction

prediction_validation_Volume <- 10^(prediction_validation)
```
```{r}
#Add predictions to df
df_validation_w_predictions <- df_validation
df_validation_w_predictions['Predicted_Volume'] <- prediction_validation_Volume

#sort the df
df_validation_w_predictions <- df_validation_w_predictions[order(df_validation_w_predictions$Predicted_Volume),]

#Add unique ID column
df_validation_w_predictions['ProductType_ProductNumber_Price']<- with(df_validation_w_predictions, paste0(ProductType,'_#', ProductNum,'_$', Price))

par(mar=c(11,4,1,1))
barplot(height = df_validation_w_predictions$Predicted_Volume, names.arg = df_validation_w_predictions$ProductType_ProductNumber_Price, las=2, cex.axis = .8 , cex.names = 0.8, ylab = 'Volume')

#aggregate by Product Type
df_ProductType_aggregate <- aggregate(df_validation_w_predictions$Predicted_Volume, by=list(Category=df_validation_w_predictions$ProductType), FUN=sum)
colnames(df_ProductType_aggregate) <- c("ProductType", "Total_Predicted_Volume")

# sort the aggregate
df_ProductType_aggregate <- df_ProductType_aggregate[order(df_ProductType_aggregate$Total_Predicted_Volume),]

par(mar=c(9,4,1,4))
barplot(height = df_ProductType_aggregate$Total_Predicted_Volume, names.arg = df_ProductType_aggregate$ProductType, las=2, ylab = 'Total Volume')
```

```{r}
#Plot Ratings and Reviews vs. Volume
x<-log10(df_validation_w_predictions$x4StarReviews)
y<-log10(df_validation_w_predictions$Predicted_Volume)
plot(x,y,col='red',xlab = "log10(# of ratings)", ylab = "log10(Predicted_Volume)")

x<-log10(df_validation_w_predictions$x3StarReviews)
points(x,y,col='green')

x<-log10(df_validation_w_predictions$x2StarReviews)
points(x,y,col='blue')
legend(2.1,2.2, legend = list('4 Stars','3 Stars','2 Stars'),col=c('red','green','blue'),pch='o')

#Plot service reviews
x<-log10(df_validation_w_predictions$PositiveServiceReview)
y<-log10(df_validation_w_predictions$Predicted_Volume)
plot(x,y,col='green',xlab = "log10(# of Service Reviews)", ylab = "log10(Predicted_Volume)")

x<-log10(df_validation_w_predictions$NegativeServiceReview)
points(x,y,col='red')

legend(2, legend = list('Positive','Negative'),col=c('green','red'),pch='o')

```

6. Once new products is prepared, use the predict() function again. This time with the new products dataset to create your final predictions in an object called finalPred.


Often times it is helpful for report building to output your data set and predictions from RStudio. Let’s add your predictions to the new products data and then create a csv file. Use your csv and Excel to organize your data for reporting.

* Add predictions to the new products data set 
     
<span style='color:red'> This was done in the previous part </span>
        
* Create a csv file and write it to your hard drive. Note: You may need to use your computer’s search function to locate your output file. 
     
```{r}
write.csv(df_validation_w_predictions, file="C2.T3output.csv", row.names = FALSE)
```

3. Use Excel to organize your predictions. Remember the four product types you need to focus on: PC, Laptops, Netbooks and Smartphones

<span style='color:red'> We'll just stick to organizing the data in R </span>

***
***
## 3. Write an informal report
Write an informal report to Danielle Sherman, in Word or PowerPoint, describing your analysis. In addition to presenting your findings, you might address questions such as the following:

* Did you learn anything of potential business value from this analysis?
* Was it straightforward to rerun your projections of sales volume using both models? 
* What are the main lessons you've learned from this experience?
* What recommendations would you give to the sales department regarding your findings relating to the different types of reviews? 

***
***
## <span style='color:red'> Multiple Regression in R Report </span>

### Introduction
In this report we review multiple regression techniques used to predict the volumes for new products from Blackwell Electronics. In developing these multiple regression models, we performed a number of preprocessing steps to elliminate features with high colinearity, scale the data so all features had similar numeric ranges, one-hot encoded (dummified) categorical string data to allow the models to leverage these categorical features, and transformed the label of interest, Volume, into a log-scale to prevent the model from ever predicting volumes of less than 0.

### Preprocessing
The "existingproductattributes2017.csv" data set was used the build the multiple regression models. After pulling the data into R, the categorical feature, Product Type, was one-hot encoded (dummified). In this step, the "dummyVars" function basically determines the number of categories, n_cat of Producty Types, then for each row of data, n_cat columns are created, with each column labeled as one of the product types. The cells of these new columns are then populated with zeros, if that given data row does not correspond to the particular product type in that column, and a value of one is populated in the cell of the product type column that the row of data was originally labeled to correspond to. In this way, we transformed string-based categorical data into numeric data, which the machine learning algorithms can leverage as features in making predictions and training.

Following the one-hot encoding, we analyzed the data to see if any columns contained "NA" values. "BestSellersRank" was observed to contain 15 NA cells, thus this column was dropped as a relevant feature.

Next, we analyzed the correlations in the data sets using a correlation plot, shown below

!["Correlation Plot"](figures/correlation_plot.png)
Here, the deep blue colors represent strong positive correlations, while the deep red cells represent strong negative correlations. Using the correlation covariance values from this table, we filtered out the features with colinearities >0.95 ("x1StarReviews" with "x2StarReviews". Furthermore, we discovered the "x5starReviews" feature had a perfect correlation of 1 with the "Volumes" label. This is a suspiciously good correlation between a feature and label, thus we analyzed the data in a scatter plot, shown below.

!["Volume vs x5StarReviews"](figures/Volume_vs_5starReviews.png)
As can be seen, this feature and our label of interest have a perfect correlation, which implies there is likely some data entry error with the "x5starReviews" feature. For this reason, this column was droped as a relevant feature for our models.

Following the exclusion of the "x1StarReviews" and "x5StarReviews" features, we transformed the label column ("Volume") into a log10 scale. This was done to prevent any of the models from ever predicting negative volumes, since a negative prediction for log10(Volume), would simply correspond to a volume <1 (10^-1 = 0.1).

Finally, the data was split into a training and testing set using a 75-25% train-test partition. 

### Training & Testing the Models
Three models were evaluated: (1) support vector machine (SVM) with linear kernel, (2) random forest (RF), and (3) eXtreme gradient boosted tree. From each model, the train and test RMSE and R-squared was calculated. The table below shows the summary of the results.

!["Train Test Metrics"](figures/Train_test_metrics.png)

To visualize the results, we also plotted the true log10(Volume) label vs. the log10(Volume) prediction.
!["Label vs. prediction plots"](figures/label_vs_prediction_plots.png)
Viewing the RMSE & R-squared summary table, along with the label vs. prediction plots, we can see that the RF and gbTree are the best models. However, comparing these two models, we also see that the gbTree has an R-squared nearly equal to one (0.9999) and the trend in the label vs prediction plot is almost perfectly linear on the training set. These two facts suggest this model is overfitting the training data and thus the RF is a better model for generalization. 

### Predicting New Product Volumes
Using the trained RF model, we performed predictions for the new products defined in the "newproductattributes.csv" data set. Prior to feeding the data into the trained model, we carried out the previously mentioned preprocessing steps (one-hot encode the product type, drop the "BestSellerRank", "x1StarReviews", and "x5StarReviews" columns, and transform volume to log10(volume)). After predicting the log10(Volume) for each case in the new product attritubes table, we extracted the predicted volume from the predicted log10(Volume) column. The bar chart below shows the breakdown of the Total (aggregate) predicted volume vs. Product Type.

!["Total_Volume_vs_Product_Type"](figures/Total_Volume_vs_Product_Type.png)
Here, we can see Tablets and Game Consoles are expected to have the highest sales volumes. Diving deeper into the data, we can breakdown the products further by product type, product number, and price. The bar chart below shows this breakdown.

!["Volume_breakdown"](figures/Volume_breakdown.png)
From these, we can more clearly see which unique product is expected to have the highest sales volume. Specifically, we see Tablet #187, sold at $199, contributes to the majority of the total volume sold by tablets, while Game Console #307 and #199 have nearly equal contribution to the total game console volume sold.

These conclusions have 2 key business value implications: (1) if the objective of sales is to minimize the number of products while maximizing sales, then focusing on Tablet #187 is the best course of action. (2) if the objective of sales is to offer the widest range of product types, while maximizing sales volume, then the team should focus on PC#17, Tablet #186, Smartphone #194, Netbook #180, Game Consoles #307 and/or #199, and Tablet #187.

Finally, the last prediction that may be of interest to the Sales team is the impact of customer rating and sales review on Volume. The scatter plot below shows the predicted volume vs # of ratings for 4 star, 3 star, and 2 star ratings, all on a log-log scale.

!["Volume_breakdown"](figures/Volume_vs_ratings.png)
Here, we can see that there is essentially a linear relationship between log10(# of ratings) and log10(predicted Volume).

Similar to the # of ratings, we also see a linear relationship between the log10(Volume) and log10(# of service reviews (positive and negative)), as can be seen in the plot below.
!["Volume_breakdown"](figures/Volume_vs_service_reviews.png)

### Task 2 Opinions/Comments
Overall I found this to be the most challenging of the tasks we have completed, largely because it required more individual exploration, rather than following the plan of attack line by line. Overall, I think this style of the activity was very educational. The part I had the most trouble with was running the initial linear model, as the errors the model was throwing were somewhat strange and there wasn't a consistant answer online as to what they actually mean. Other than that though, I did find it pretty straightforward to rerun predictions using different models, though I wish R had better "function" capabilities, more similar to python, because I did find myself just copying and pasting lines of code because I didn't feel like dealing with the unique characteristics of R functions.
